In [1]:
# coding: utf-8
import os
from cheshire3.baseObjects import Session
from cheshire3.document import StringDocument
from cheshire3.internal import cheshire3Root
from cheshire3.server import SimpleServer
session = Session()
session.database = 'db_dickens'
serv = SimpleServer(session, os.path.join(cheshire3Root, 'configs', 'serverConfig.xml'))
db = serv.get_object(session, session.database)
qf = db.get_object(session, 'defaultQueryFactory')
resultSetStore = db.get_object(session, 'resultSetStore')
idxStore = db.get_object(session, 'indexStore')
When using the any search function to search for two different terms, the results are wrong.
Problem 1: searching for fog OR dense is not the same as dense OR fog.
Problem 2: Second, the counts for fog OR dense are off.
Currently, there are 150 results for fog OR dense and 221 for dense OR fog, but there should be many more (142 or 144 if one counts compound nouns).
In [2]:
# This is the query that is currently being used.
# The count is the number of chapters
query = qf.get_query(session, """
((c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx = "fog") or c3.chapter-idx = "dense")
""")
result_set = db.search(session, query)
print len(result_set)
In [3]:
# To get a more speficic count one also needs to include the numbers of hits
# in the different chapters
def count_total(result_set):
"""
Helper function to count the total number of hits
in the search results
"""
count = 0
for result in result_set:
count += len(result.proxInfo)
return count
In [4]:
count_total(result_set)
Out[4]:
In [5]:
def try_query(query):
"""
Another helper function to take a query and return
the total number of hits
"""
query = qf.get_query(session, query)
result_set = db.search(session, query)
return count_total(result_set)
This query gets wrong results because it the OR query is poorly constructed
In [6]:
try_query("""
((c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx = "dense") or c3.chapter-idx = "fog")
"""
)
Out[6]:
Properly structuring the OR clause takes away the problem of having different results for
for OR dense
dense OR fog
Option 1
In [7]:
try_query("""
(c3.subcorpus-idx all "dickens" and/cql.proxinfo (c3.chapter-idx = "dense" or c3.chapter-idx = "fog"))
"""
)
Out[7]:
Option 2
In [8]:
try_query("""
(c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx any "dense fog")
"""
)
Out[8]:
In [9]:
try_query("""
(c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx any "fog dense")
"""
)
Out[9]:
Option 3: the verbose one
In [10]:
try_query("""
((c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx = "dense") or
(c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx = "fog"))
"""
)
Out[10]:
To really get the right results, though, one should not just use any, but rather any/cql.proxinfo.
In [11]:
try_query("""
(c3.subcorpus-idx all "dickens" and/proxinfo (c3.chapter-idx = "dense" or/proxinfo c3.chapter-idx = "fog"))
"""
)
Out[11]:
Or in its simpler form:
In [12]:
try_query("""
(c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx any/proxinfo "fog dense")
"""
)
Out[12]:
This does not seem to be affected by whether you mention cql or not (that is a cql specification, if I am not wrong).
In [13]:
try_query("""
(c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx any/cql.proxinfo "fog dense")
"""
)
Out[13]:
The counts are now correct:
In [14]:
dense = try_query("""(c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx = "dense")""")
print dense
In [15]:
fog = try_query("""(c3.subcorpus-idx all "dickens" and/cql.proxinfo c3.chapter-idx = "fog")""")
print fog
In [16]:
dense + fog
Out[16]: